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Langevin Dynamics, Monte Carlo, and all-atom Molecular Dynamics simulations in 
implicit solvent, widely used to access the microscopic transitions in biomolecules, re- 
quire a reliable source of random numbers. Here we present the two main approaches 
for implementation of random number generators (RNGs) on a GPU, which enable 
one to generate random numbers on the fly. In the one-RNG-per-thread approach, 
inherent in CPU-based calculations, one RNG produces a stream of random numbers 
in each thread of execution, whereas the one-RNG-for-all-threads approach builds 
on the ability of different threads to communicate, thus, sharing random seeds across 
the entire GPU device. We exemplify the use of these approaches through the devel- 
opment of Ran2, Hybrid Tans, and Lagged Fibonacci algorithms fully implemented 
on the GPU. As an application-based test of randomness, we carry out LD simu- 
lations of N independent harmonic oscillators coupled to a stochastic thermostat. 
This model allows us to assess statistical quality of random numbers by comparing 
the simulation output with the exact results that would be obtained with truly ran- 
dom numbers. We also profile the performance of these generators in terms of the 
computational time, memory usage, and the speedup factor (CPU/GPU time). 
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I. INTRODUCTION 

Over the last few years, graphics processors have evolved into highly parallel, multithreaded 
computing devices. Graphics Processing Units (GPUs) are now emerging as an alternative 
jrogramming platform that provides high raw computational power for scientific applications 
l|-|8|. Because GPUs implement Single Instruction Multiple Data (SIMD) architecture with 
reduced cache and flow control for a group of computational cores, most of a GPU device 
form computational units dedicated to actual calculations. The computational efficiency of 
contemporary GPUs can reach striking 1 TFlops for a single chip [Qi], the number not yet 
accessible even for most up-to-date CPUs. With introduction of CUDA (Compute Unified 



Device Architecture) by NVIDIA (a dialect of C and C+-|- programming languages) 



GPUs have become capable of performing compute-intensive scientific calculations. Because 
the CPU-based calculations are 10—50 times faster than some of the heavily tuned CPU-based 
methods, GPUs are being used as performance accelerators in a variety of applications jl, 2, 7, 8|. 

The CPU-based calculations can be performed concurrently on many computational cores, 
called Arithmetic Logic Units (ALUs), that are grouped into multiprocessors, each with its 
own fiow control and cache units. For example, in contemporary graphics cards from NVIDIA 
each multiprocessor contains up to eight r^l.3GHz ALUs and 14— 16KB of cache {8KB of 
constant memory cache and 6— 8KB of the global memory cache when accessed through texture 
references). The number of multiprocessors per GPU can reach 30 on the most up-to-date 
graphics cards (Tesla C1060 or GeForce GTX 285), thus, bringing the total number of ALUs to 
240 per chip. Due to the inherently parallel nature of the CPU-based calculations, achieving 
optimal performance on the GPU mandates that a computational task be divided into many 
independent threads of execution that run in parallel performing the same operation but on 
different data sets. Although each graphics card has its own global memory with ~10 times 
larger bandwidth compared to DRAM on a CPU, the number of memory invocations (per ALU) 
should be minimized to optimize the GPU performance. Hence, the computational task should 
be compute-intensive so that, most of the time, the GPU is busy performing computations rather 
than reading and writing data j^, [l^. This makes a classical body problem that would be 
difficult or impossible to solve exactly into a prime candidate for the numerical implementation 
on the GPU. 

Computer simulations of a system of particles, e.g., Langevin Dynamics (LD), Monte Carlo 
(MC), and Molecular Dynamics (MD) simulations, are among many applications that can be 
implemented on the GPU. In LD and MD simulations, atomic interactions are described by 
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the same potential energy function (force field) applied to all particles in the system. Hence, 
there is a direct mapping between the SIMD architecture of the GPU (hardware) and numerical 
routines (software) used to follow a trajectory of the system under the study in real time. In 
a sense, a "single instruction", i.e. calculation of the potential energy terms or evaluation of 
forces and random forces, or numerical integration of the equation(s) of motion, is executed on 
"multiple data" sets (for all particles) in order to describe the dynamics of the whole system. For 
example, in MD simulations of biomolecules in implicit solvent (water) , the dynamics of 

the i-th particle are governed by the equations of motion for the particle position, dIii/dt=V'i 
{Ri={Ri^^,Ri^y,Ri^^}), and velocity, midYi/dt=^Yi+f(Ri)+Gi{t) (yi={Vi^x,Vi^y,Vi^;,}), where 
rrii is the particle mass, ^ is the friction coefficient, f(Ri) = —dXJ/dHi is the molecular 
force exerted on the i-th particle due to the potential energy U=U(Ri, R2, . . . , Rn), and 

Gi{t)={Gi^x,Gi^y,Gi^z} is the Gaussian random force with the first moment (Gj(t))=0 and 

i 

the two-point correlation function (Gi(t)Gj(t')) = 2kBT^6ij6{t - t') (i, j=l,2, . . . , iV) [isi]. In 
LD simulations of proteins, the dynamics of the z-th Co-particle is governed by the Langevin 
equation for Rj, ^dKi/ dt=i(R,i)+Gi{t) [l^. These equations of motion are solved numerically 
over many iterations of the simulation algorithm. 

Since in MD simulations in implicit water and in LD simulations, the effect of solvent 
molecules is described implicitly, these methods require a reliable source of 3A^ normally dis- 
tributed random numbers, gi^a (^=1? 2, . . . , A^) to compute the components of the Gaussian ran- 



dom force Gi^a=9i,aV'2^BT^At, where At is the time step {a=x, y, and z). In MC simulations, 
the results of multiple independent trials, each driven by some random process, are combined 
to extract the average answer. A pseudo-random number generator, or algorithmic RNG, must 
have a long period and must meet the conflicting goals of being fast while also providing a large 
amount of random numbers of proven statistical quality 15| . An RNG produces a deterministic 
sequence of random numbers, Ui, that are supposed to imitate realizations of independent uni- 
form random variables from the interval (0, 1), i.e., i.i.d. f/(0, 1). This sequence (uj) is translated 
into the sequence of normally distributed random variables (q,;) using the ziggurat method [l^ 
or the polar method or the Box- Mueller transformation 18|. There is an extensive body 
of literature devoted to random number generation on the CPU [19]. Yet, due to fundamental 
differences in processor and memory architecture of CPU and GPU devices, the CPU-based 
methods cannot be easily translated from the CPU to the GPU. While there exist stand-alone 
implementations of good quality RNGs on the GPU, to fully utilize computational resources of 
the GPU in molecular simulations an RNG should be incorporated into the main simulation 
program. This enables a developer to minimize read/write calls associated with invocation of 
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the relatively slow GPU global memory, and to generate streams of random numbers using fast 
GPU shared memory. 

We present a novel methodology for generating pseudo-random numbers on a GPU on the 
fly, i.e. at each step of a simulation run. This methodology can be used in the development 
of the GPU-based implementations of MD simulations in implicit solvent, and in LD and MC 
simulations. We focus on the Linear Congruential Generator (LCG), and the Ran2, Hybrid 
Taus, and Lagged Fibonacci algorithms reviewed in the next Section. These algorithms are used 
in Section IIIII to describe the one-RNG-per-thread approach and the one-RNG-for-all-threads 
approach for random number generation on the GPU. In the one-RNG-per-thread setting, one 
RNG is assigned for each computational thread (for each particle), a procedure commonly used 
in the CPU-based calculations. The one-RNG-for-all-threads method utilizes the ability of 
different threads to communicate across the entire GPU device (pseudocodes are given in the 
Appendices). We test the performance of GPU-based implementations of these generators in 
Section HVt where we present application-based assessment of their statistical qualities using 
Langevin simulations of independent Brownian particles evolving on the harmonic potential. 
We profile these generators in terms of the computational time and memory usage for varying 
system size N. The main results are discussed in Section |Vl where we provide recommendations 
on the use of RNG algorithms. 

II. PSEUDORANDOM NUMBER GENERATORS 
A. Overview 

There are three types of random numbers generators: true or hardware random numbers 
generators, and software-based quasi-random numbers generators and pseudo-random numbers 
generators (RNGs) [20] . In this paper we focus on algorithmic RNGs - the most common type 
of deterministic random number generators. Because an RNG produces a sequence of random 
numbers in a purely deterministic fashion, a good quality RNG should have a long period and 
should pass some stringent statistical tests for uniformity and independence. MD simulations of 
biomolecules in implicit water and LD simulations of proteins use normally distributed random 
forces to emulate stochastic kicks from the solvent molecules. To generate the distribution of 
random forces, a common approach is to convert the uniformly distributed random variates into 
the Gaussian distributed random variates using a particular transformation. In this paper, we 
adopt the most commonly used Box-Mueller transformation jisl . 
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There are three main requirements for a numerical implementation of an RNG: (1) good 
statistical properties, (2) high computational speed, and (3) low memory usage. A deterministic 
sequence of random numbers comes eventually to a starting point, i.e. to the initial set of random 
seeds Un+p=Un [21] . This mandates that an RNG should have a long period p. For example, 
a simulation run might use 10^^ random numbers, in which case the period must far exceed 
10^^. Once an RNG has been selected and implemented, it must also be tested empirically for 
randomness, i.e., for the uniformity of distribution and for the independence 15|]. In addition, 



it must also pass application-based tests of randomness that offer exact solutions to the test 
applications. Using random numbers of poor statistical quality in molecular simulations might 
result in insufficient sampling, unphysical correlations or even patterns 
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23| . and unrealistic 



results, which leads to errors in practical applications [24,]. Some of the statistical tests of 



randomness are accumulated in the DIEHARD test suite and in the TestUOl library 15 



25|-|27|. 



In numerical implementations, a good quality RNG should also be computationally efficient 
so that random number generation does not become a major bottleneck. For example, in LD 
simulations of proteins on a GPU, one can obtain long 0.1s trajectories over as many as 10^° 
iterations. Hence, to simulate one trajectory for a system of 10^ particles requires total of ~10^^ 
random numbers. The requirement of low memory usage is also important as modern graphics 
processors have low on-chip memory, ~ 20KB per multiprocessor, compared to ~2Mi? memory 
on the CPU. Hence, an efficient RNG algorithm must use limited working area without invoking 
the slow GPU global memory. 

Typically, a fast RNG employs simple logic and a few state variables to store its current 
state, but this may harm statistical properties of the random numbers produced. On the other 
hand, using more sophisticated algorithms with many arithmetic operations or combining several 
generators into a hybrid generator allows to improve statistics, but these generators are slower 
and use more memory. Hence, a choice of RNG is determined primarily by specific needs 
of a particular application, including statistical characteristics of random numbers, and GPU 



capabilities. In thispaper, we focus on the most widely used algorithms: 



Generator (LOG) 19], and the Ran2 [l9(]. Hybrid Taus generator [19, 



Fibonacci algorithm 



19 



30 
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inear Congruential 



28l 



29| and Lagged 



3l| . LCG can be used in performance benchmarks since it employs 



a very fast algorithm. Ran2 is a standard choice for MD simulations of biomolecules in implicit 
water and in LD simulations of proteins due to its long period (p>2xl0^^), good statistical 
quality, and high computational performance on the CPU. However, Ran2 requires large amount 
of on-chip GPU local memory and global memory to store its current state. Hybrid Taus is an 
example of how several simple algorithms can be combined to improve statistical characteristics 
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of random numbers. It scores better in terms of the computational speed on the GPU than 
KISS, the best known combined generator |32|, and its long period (p>2xl0'^^) makes it a good 
choice for molecular simulations on the GPU. Lagged Fibonacci employs very simple logic while 



producing random numbers of high statistical quality [I5|,l3ll]- It is commonly used in distributed 
MC simulations, and it can also be utilized in GPU-based computations. Here, we briefly review 
the LCG, Ran2, Hybrid Taus, and Lagged Fibonacci algorithms. 

B. Linear Congruential Generator 

The Linear Congruential Generators (LCGs) are the classic and most popular class of gen- 
erators, which use a transitional formula, 

Xn = {cLXn-1 + c) mod 171, (l) 

where m is the maximum period, and a=1664525 and c=1013904223 are constant parameters 



19| . To produce a uniformly distributed random number, Xn is divided by 2^"^. Assuming a 
32-bit integer, the maximum period can be at most p=2^^, which is far too low. LCGs also 



have known statistical flows 15|. If m=2^^, one can neglect mod m operation as the returned 
value is low-order 32 bits of the true 64-bit product. Then, the transitional formula reads 
Xn=CLXn-i+c, which is the so-called Quick and Dirty (or ranqd2) generator (simplified LCG 
generator). Quick and Dirty LCG is a very fast generator as it takes only a single multiplication 
and a single addition to produce a random number, and it uses a single integer to describe its 
current state. Due to low memory usage. Quick and Dirty LCG can be used to benchmark 
GPU-based implementations of software packages at the development stage. 

C. Ran2 

Ran2, one of the most popular RNGs, combines two LCGs and employs randomization using 



some shuffling procedure 
good statistical properties 



9|. Ran2 has a long period and provides random numbers of very 
15j. In fact, Ran2 is one of a very few generators that does not fail 



a single known statistical test. It is reasonably fast, but there are several features that make 
Ran2 less attractive for GPU-based computations. First, the algorithm involves long integer 
arithmetic (64-bit logic) - a computational bottleneck for contemporary CPUs. Secondly, it 
requires a large amount of memory to store its current state that needs to be updated at each 
step. This involves a large number of memory calls, which may, potentially, slow down the 
computational speed on the low cache GPU. 



7 



D. Hybrid Taus 

Hybrid Taus [201] is a combined generator that uses LCG and Tausworthe algorithms. Taus- 



worthe taus88 is a fast equidistributed modulo 2 generator [28|, [29], which produces random 
numbers by generating a sequence of bits from a linear recurrence modulo 2, and forming the 
resulting number by taking a block of successive bits. In the space of binary vectors, the n-th 
element of a vector is constructed using the linear transformation, 

Vn = aiUn-i + a2yn-2 + • • • akVn-k, (2) 

where a„ are constant coefficients. Given initial values, yo, yi, ■ ■ ■ yn~i, the n-th random integer 
is obtained as a;„=^^^^?/„s+j_i2^-', where s is a positive integers and L=32 is the integer size 
(machine word size). Computing Xn involves performing s steps of the recurrence, which might 
be costly computationally. Fast implementation can be achieved for a certain choice of param- 
eters. When ak=ag=aQ=l, where 0<2g< k and a„=0 for 0<s<k—q<k<L, the algorithm can 
be simplified to a series of binary operations Statistical characteristics of random numbers 
produced u sing taus88 alone are poor, but combining taus88 with LCG removes all the statisti- 



cal defects 20|. In general, statistical properties of a combined generator are better than those 
of components. When periods of all components are co-prime numbers, a period of a combined 
generator is the product of periods of all components. A similar approach is used in the KISS 
generator, which combines LCG, Tausworhe generator, and a pair of multiple- with-carry gener- 



ators |32|. However, multiple 32-bit multiplications, used in KISS, may harm performance on 
the GPU. The period is the lowest common multiplier of the periods of three Tausworthe steps 
and one LCG. We used parameters that result in periods of pi=2^^ — l, p2=2'^°— 1, and ^3=2^^—1 
for the Tausworthe generators and a period of p4=2^^ for the LCG, which makes the period of 
the combined generator equal ~2^^^>10'^^. Hybrid Taus uses small memory area since only four 
integers are needed to store its current state. 



E. Lagged Fibonacci 

The Lagged Fibonacci algorithm is defined by the recursive relation, 

Xn = f{xn-si, Xn-ii) mod m, (3) 

where si and // are the short and long lags, respectively {ll>sl), m defines the maximum period 
and / is a function that takes two integers Xn-si and Xn-ii to produce integer x„. The most 
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commonly used functions are multiplication, f {xn-si, Xn-ii)=Xn-si*Xn-ii (multiplicative Lagged 
Fibonacci), and addition, f{xn-si,Xn-ii)=Xn-si+Xn-ii (additive Lagged Fibonacci). Random 
numbers are generated from the initial set of // integer seeds, and to achieve the maximum 
period the long lag // should be set equal the base of a Mersenne exponent, and the 

short lag si should be taken so that the characteristic polynomial a;"+a;*'+l is primitive. In 
addition, si should not be too small or too close to II; it is recommended that sl^pxll, where 
p^O.618 [31]. When single precision arithmetic is used, the mod m operation can be omitted by 
setting m=2^^ (more on selection of parameters can be found in Ref. 3l|, |33|). We employed the 
additive Lagged Fibonacci RNG, which generates floating point variates directly, without the 
usual floating of random integers. Also, si and II can be taken to be very large, which improves 
statistical quality of the generator. 



III. GPU-BASED IMPLEMENTATION OF LCG, RAN2, HYBRID TAUS, AND 

LAGGED FIBONACCI ALGORITHMS 

A. Basic ideas 

The main feature that makes GPUs computationally efficient is their many-thread architec- 
ture, i.e. calculations are performed on a GPU using threads working in parallel. Hence, in 
molecular simulations of an body system on a GPU an RNG should produce independent 
random numbers simultaneously for all particles. One possibility is to have random numbers 
pre-generated on a CPU or on a GPU, and then use these numbers in simulations. However, this 
requires a large amount of memory allocated for an RNG. For example, for a system of 10^ par- 
ticles in three dimensions, 3x10^ random numbers are needed at each simulation step. If these 
numbers are pre-generated, say, for every 100—1000 steps, it requires 3x10^—3x10^ random 
numbers to be stored on the GPU, which takes 12— 120Mi? of memory. This might be signifi- 
cant for graphics cards with limited memory, e.g., GeForce GTX 200 series (from NVIDIA) with 
r^lGB of memory. Another approach is to build an RNG into the main simulation kernel. This 
allows one to achieve top performance for an RNG by maximizing the amount of computations 
on a GPU while also minimizing the number of calls of the GPU global memory (read/write 
operations). In addition, to fully utilize the GPU resources, the total number of threads should 
be ~10-times larger than the number of computational cores, so that none of the cores awaits 
for the others to complete their tasks. 

To develop parallelized implementations of several different RNGs on the GPU, we employ 
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cycle division paradigm 30|]. The idea is to partition a single RNG sequence, which can be 
viewed as a periodic circle of random numbers, among many computational threads running 
concurrently across the entire GPU device, each producing a stream of random numbers. Since 
most RNG algorithms are based on sequential transformations of the current state, including 
LCG, Hybrid Taus and Ran2, the most common way of partitioning the sequence is to provide 
each thread with different seeds while also separating the threads along the sequence to avoid 
possible inter-stream correlations. This is the basis of the one-RNG-per-thread approach (Fig. 1). 
Also, there exist RNG, e.g., Mersenne Twister and Lagged Fibonacci algorithms, that allow one 
to leap ahead in the sequence to produce the + random number without first computing the 



n-th number 



30 



32 



34| . The leap size, which, in general, depends on parameters of an RNG, 



can be adjusted to the number of threads (equal the number of particles A^), or multiples of 
N (MxN). Then, all N random numbers can be obtained simultaneously, i.e. the j-th thread 
produces numbers j, j+N, j+2N. . ., etc. Note that at the end of each simulation step, threads 
must be syncronized so that the current RNG state is properly updated. The same RNG state 
is used by all threads, each updating just one elements of the state. We refer to this as the 
one-RNG-for-all-threads approach (Fig. 1). In what follows, we describe these approaches in 
more detail. 



B. One-RNG-per-thread approach 

The idea is to run the same RNG algorithm in many threads, where all RNGs generate 
different subsequences of the same sequence of random numbers using the same algorithm, but 
starting from different initial seeds. First, a CPU generates sets of random seeds (one for each 
RNG) and passes them to the GPU global memory (Fig. 2). To exclude correlations, these sets 
should come from an independent sequence of random numbers, or should be generated using 
different RNG algorithms on the CPU. In a simulation run, each thread on the GPU reads its 
random seeds from the GPU global memory and copies them to the GPU local (per thread) 
memory or shared (per thread block) memory. Then, each RNG can generate as many random 
numbers as needed, without using the slow GPU global memory. At the end of a simulation step, 
each RNG saves its current state to the global memory and frees shared memory. Since each 
thread has its own RNG, there is no need for threads synchronization; however, when particles 
interact threads must be synchronized. In molecular simulations of a system of A^ particles, 4A^ 
uniformly distributed random variates are needed at each step, and arrays of initial seeds and 
the current state should be arranged for coalescent memory read to speedup the global memory 



10 



access. In the one-RNG-per-thread setting, an RNG should be very hght in terms of memory 
usage. Small size of on-chip memory can be insufficient to store the current state of an RNG 
that is based on a complex algorithm. These restrictions make it difficult to use simple RNG 
algorithms, especially when statistical properties of random numbers become an issue. 

In the one-RNG-per-thread approach, the amount of memory required to store the current 
state of a generator is proportional to the number of threads (number of particles A^). Hence 
a significant amount of memory has to be allocated for all RNGs to describe the dynamics 
for a large system. For example, LCG uses one integer seed to store its current state, which 
takes 4 bytes per thread (per generator) or ~4Mi? of memory for 10^ threads, whereas Hybrid 
Taus uses 4 integers, i.e. 16MB of memory. These are acceptable numbers, given hundreds of 
megabytes of the GPU memory. By contrast, Ran2 uses 35 long integers and a total of 280 
bytes per thread, or ~280Mi? of memory (for 10^ threads). As a result, not all seeds can be 
stored in on-chip (local or shared) memory {^16KB), and the Ran2 RNG has to access the 
GPU global memory to read and update its current state. In addition, less memory becomes 
accessible to other computational routines. This might prevent using Ran2 in the simulations 
of large systems on some graphics cards, including GeForce GTX 280 and GTX 295 (NVIDIA), 
with 768MB of global memory (per GPU). However, this is not an issue when using high end 
graphics cards, such as Tesla C1060 with AGB of global memory. In this paper, we utilized the 
one-RNG-per-thread approach to develop the GPU-based implementations of the Hybrid Taus 
and Ran2 algorithms (pseudocodes are presented in Appendix A). 



C. One-RNG-for-all-threads approach 

Within the one-RNG-for-all-threads approach, one can use a single RNG by allowing all 
computational threads to share the state of a generator. This approach can be adapted to RNG 
algorithms that are based on the recursive transformations, i.e., Xn=f{yn-r,yn-r+i,---yn-k), 
where r is the recurrence degree and k>r is a constant parameter. This transformation allows 
one to obtain a random number at the n-th step from the state variables generated at the previous 
steps n—r, n—r+1, . . ., n—k. If a sequence of random numbers is obtained simultaneously in 
threads, each generating just one random number at each step, then total of A^ random numbers 
are produced. Then, given k>N, all the elements of the transformation have been obtained in the 
previous steps, in which case they can be accessed without threads synchronization. One of the 
algorithms that can be implemented on the GPU using the one-RNG-for-all-threads approach 



is Lagged Fibonacci (Fig. 3) 3J]. When one random number is computed in each thread and 
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when sl>N and ll—sl>N (Section III E\} . N random numbers can be obtained simultaneously 
on the GPU without threads synchronization. 

To initialize the Lagged Fibonacci RNG on the GPU, // integers are allocated on the CPU 
using initial seeds. Each thread then reads two integers from this sequence, which correspond to 
the long lag // and the short lag si, generates the resulting integer, and saves it to the location 
in the GPU global memory, which corresponds to the long lag. Setting sl>N and ll—sl>N 
guarantees that the same position in the array of integers (current state variables) will not be 
accessed by different threads at the same time. The moving window of random numbers, 
updated by threads at each step, is circling along the array of state variables, leaping forward 
by positions (at each step). Importantly, a period of the Lagged Fibonacci generator, p~2""'"^^, 
can be adjusted to the system size by assigning large values to si and //, so that p^A^xS, where 
5* is the total number of simulation steps. Changing II and si does not influence the execution 
time, but affects the size of the array of state variables, which scales linearly with // - the amount 
of integers stored in the GPU global memory. Large // is not an issue even when ZZ~10^, which 
corresponds to ~4Mi? of the GPU global memory (pseudocode for the Lagged Fibonacci RNG 
is presented in Appendix B). Note, that the CPU-based implementation of the Lagged Fibonacci 
algorithm using the one-RNG-per-thread approach requires to store A^ independent RNG states 
of size //, i.e. A^ times larger memory. 

IV. APPLICATION-BASED TEST OF RANDOMNESS: ORNSTEIN-UHLENBECK 

PROCESS 

To assess the computational and statistical performance of the LCG, Ran2, Hybrid Taus, 
and Lagged Fibonacci algorithms in molecular simulations, we carried out Langevin simulations 
of A^ independent one-dimensional harmonic oscillators in a stochastic thermostat, fully imple- 
mented on the GPU. Each particle evolves on the harmonic potential, V{Ri)=kspR] /2, where 
Ri is the i-th particle position and kgp is the spring constant. We employed this analytically 
tractable model from statistical physics to compare the results of simulations with the theoreti- 
cal results that would be obtained with truly random numbers. In the test simulations, we used 
NVIDIA graphics card GeForce GTX 295, which has two processing units (CPUs), each with 
30 multiprocessors (total of 240 ALUs) [9] and 7Q8MB of global memory. 

The Langevin equations of motion in the overdamped limit. 



dRi 



dt 



dV{Ri,R2,...,RN) 
dRi 



(4) 
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were integrated numerically using the first-order integration scheme (in powers of the integration 
time step At) |35| . 

Ri{t + At) = Ri{t) + /(i?,(t))At/e + g,{t)^2k^fiKt, (5) 



where f{Ri)=—{dV{Ri,R2, ■ ■ ■ ,RN)/dRi) is the force acting on the i-th oscillator |36l-l38| . In 
Eq. Qi are the Gaussian distributed random variates (with zero mean and unit variance), 
which are transformed into the random forces Gi{t)=gi{t)^J2kBT^At. Langevin dynamics in the 



overdamped limit (Eqs. dl]) and (|5])) are widely used in the simulations of biomolecules 37H43| 



Numerical values of the constant parameters for the LCG, Ran2, Hybrid 



aus, and Lagged 
15] ■ in Ref. [l9|, in Appendix A, 



Fibonacci algorithms can be found, respectively, in Section II 
and in Table I. 

We employed the one-RNG-per-thread approach to develop the GPU-based implementations 
of the LCG, Ran2, and Hybrid Taus algorithms, and used the one-RNG-for-all-threads approach 
for the Lagged Fibonacci RNG. These implementations have been incorporated into the LD sim- 
ulation program written in CUDA. Numerical algorithms for the GPU-based implementation of 
LD simulations of biomolecules, which involves evaluation of the potential energy, calculation 
of forces, and numerical integration of the Langevin equations of motion, will be presented in 
a separate publication (A. Zhmurov, R. I. Dima, Y. Kholodov, and V. Barsegov, submitted 
to J. Chem. Theory and Comput.) In our implementation, each computational thread gen- 
erates one trajectory for each particle, and we used 64 threads in a thread block. Numerical 
calculations for A^=10'^ particles were carried out with the time step At=lps, starting from the 
initial position RQ=10nm, and using ksp=0.01pN/nm, T = 3007^^, and D = 0.25n'm'^ /ns. Soft 
harmonic spring {O.OlpN/nm) allowed us to generate long 1ms trajectories over 10^ steps. We 
analyzed the average position {R{t)) and two-point correlation function {R{t)R{0)), obtained 
from simulations, and have compared these quantities with their exact counterparts jl3l . [l^ . 
(i?(i:))=i?j(0)exp [— t/r] and {R(t)R{0)) = {kBT/ksp)exp[—t/T], respectively, where T=^/ksp is 
the characteristic time. All RNGs describe well the exact Brownian dynamics except for LCG 
(Fig. 4). Both {R(t)) and {R(t)R{0)), obtained using Ran2, Hybrid Taus, and Lagged Fibonacci, 
practically collapse on the theoretical curve of these quantities. By contrast, using LCG results 
in repeated patters of {R{t)) and unphysical correlations in {R{t)R{0)) (Fig. 4). At longer times, 
{R{t)) and {R{t)R{0)), obtained from simulations, deviate somewhat from the theoretical curves 
due to a soft harmonic spring and insufficient sampling (Fig. 4). 

In biomolecular simulations on a GPU, a large memory area should be allocated to store 
parameters of the force field, Verlet lists, interparticle distances, etc., and the memory demand 
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scales with the system size as r^N"^. In contemporary graphics cards, the amount of global 
memory is low, and each memory access takes ~300 clock cycles. The number of memory calls 
scales linearly with the amount of random numbers produced. Because the computational speed 
even of a fast RNG is determined mostly by the number of global memory calls, multiple reads 
and writes from and to the GPU global memory can prolong significantly the computational 
time. We profiled the LCG, and the Ran2, Hybrid Taus, and Lagged Fibonacci RNGs in terms 
of the number of global memory calls per simulation step. These generators use, respectively, 
1, 40, 4, and ~3 random seeds per thread (the state size for Lagged Fibonacci depends on the 
choice of parameters II and si). In our implementation, the LCG, and the Hybrid Taus and 
Lagged Fibonacci RNGs use 4—16 bytes of memory per thread, which is quite reasonable even 
for large system size A'^=10^. However, Ran2 requires 280 bytes per thread which is significant 
for a large system (Table II). Ran2 has large state size, and saving and updating its current 
state using the GPU local or shared memory is not efficient computationally. Ran2 uses long 
64-bit variables, which doubles the amount of data, and requires 4 read and 4 write calls (7 
read and 7 write memory calls are needed to generate 4 random numbers). The Hybrid Taus 
RNG uses the GPU global memory only when it is initialized, and when it updates its current 
state. Since it uses 4 state variables, 4 read and 4 write memory calls per thread are required 
irrespectively of the amount of random numbers (Table II). The Lagged Fibonacci RNG uses 
2 random seeds, which results in 2 read and 1 write memory calls per random number, and 8 
read and 4 write calls for four random numbers (Table II). 

To benchmark the computational efficiency of the LCG, and the Ran2, Hybrid Taus, and 
Lagged Fibonacci RNGs, we carried out LD simulations of N three-dimensional harmonic os- 
cillators in a stochastic thermostat. For each N, we generated one simulation run over ?t,=10^ 
steps. All threads have been synchronized at the end of each step to emulate an LD simu- 
lation run of a biomolecule on a GPU. The execution time and memory usage are displayed in 
Fig. 5. Ran2 is the most demanding generator: the use of Ran2 in LD simulations of a system 
of 10^ particles adds extra ~264 hours of wall-clock time to generate a single trajectory over 10^ 
steps (on NVIDIA GeForce GTX 295 graphics card). The memory demand for Ran2 is quite 
high (>250MB for A"=10^). In addition, implementing Ran2 on the GPU does not lead to 
a substantial speedup compared to the CPU-based implementation (Fig. 5). By contrast, the 
Hybrid Taus, and Lagged Fibonacci RNGs perform almost equally well in terms of the compu- 
tational time and memory usage (Fig. 5). These generators require a small amount of memory 
(<15— 20MS) even for a large system of 10^ particles (data not shown). 
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V. DISCUSSION AND CONCLUSION 

Increasing the computational speed of a single CPU core becomes more and more challeng- 
ing for CPU manufacturers. With accelerated working frequency of modern CPUs, high power 
throughput results in CPU overheating, which prohibits unlimited growth in their computational 
power. In this regard, graphics processors are emerging as an alternative type of computing de- 
vices that evolve through increasing the number of computational cores rather than working 
frequency of a few cores. The highly parallel architecture of the GPU device provides an alter- 
native computational platform that allows one to utilize multiple ALUs on a single processor. 
However, this comes at a price of having smaller cache memory and reduced flow control. Hence, 
to harvest raw computational power offered by the GPU in a particular application, one has 
to re-design computational algorithms that have been used on the CPU for several decades. 
The programmer has to be able to decompose each computational task into many independent 
threads of execution. In addition, care has to be taken to ensure coalescent memory access, and 
proper threads synchronization and communication. 

Random number generators (RNGs) are needed for most of computer applications such as 
simulations of stochastic systems, probabilistic algorithms, and numerical analysis among others. 
We described the one-RNG-per-thread approach and the one-RNG-for-all-threads approach for 
random number generation on the GPU (Fig. 1), which we applied to the LCG, and to the Ran2, 
Hybrid Taus, and Lagged Fibonacci generators. We have tested these RNGs using Langevin 
simulations of independent Brownian particles, evolving on the harmonic potential. The 
LCG, Hybrid Taus, and Ran2 algorithms were realized on the GPU as independent RNGs 
producing many streams of random numbers at the same time (one-RNG-per-thread approach, 
Fig. 2). Additive Lagged Fibonacci algorithm was implemented using many threads generating 
a single sequence of random numbers (one-RNG-for-all-threads approach. Fig . 3). The Hybrid 
Taus and Lagged Fibonacci algorithms of good statistical quality [isl [2o[ Isil provide random 
numbers at a computational speed almost equal to that of the Quick and Dirty LCG (Fig. 4), 
and the associated memory demand is rather low (Figs. 5). Their long periods are sufficient 
to describe stochastic dynamics of a very large system (A^>10^ particles) on a long timescale 
(n>10^ simulation steps). This makes the Hybrid Taus and Lagged Fibonacci algorithms a very 
attractive option for molecular simulations of biomolecules on the GPU. Ran2 is a well tested 



generator of proven statistical quality (Fig. 4) [19||. It is probably the best RNG choice for 
molecular simulations on the CPU, but it works almost ten-fold slower on the GPU and requires 
large memory area (Fig. 5). Because using Ran2 in the molecular simulations of large systems 
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can decrease significantly the computational speed of numerical modeling, Ran2 can be used in 
the simulations of small systems (A^<10^ particles). 

Statistical characteristics of random numbers, generated by using the one-RNG-per-thread 
approach, do not improve with the increasing system size. In this setting, each RNG working in 
each thread uses its own state and, hence, increasing the number of threads (number of particles 
A^) results in the increased number of generators, but it does not improve their statistical 
qualities. In the one-RNG-for-all-threads approach, streams of random numbers are produced 
in many threads running in parallel and sharing the same state variables. As a result, statistical 
properties of the random numbers improve with the increasing size of the RNG state. This 
is a general property of RNG implementation based on the one-RNG-for-all-threads approach 
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31| . For this reason, we recommend the Lagged Fibonacci RNG for compute- intensive 



LD simulations and MD simulations in implicit solvent of large biomolecules, and in parallel 
tempering algorithms including variants of the replica exchange method. Also, in the one-RNG- 
for-all-threads approach only one sequence on random numbers is generated, which makes is 
possible to compare directly the results of simulations on the CPU and on the GPU. This can 
be used in benchmark tests to estimate numerical errors due to single precision floating point 
arithmetic, rounding-off errors, or to identify bad memory reads on the GPU. 

Profiling the computational performance of the Hybrid Taus and Lagged Fibonacci generators 
have revealed that for these RNGs the execution time scales sublinearly with A^ (i.e. remains 
roughly constant) for A^<5xl0^ due to insufficient parallelization, but grows linearly with A^ 
for larger sistems when all ALUs on the GPU become fully loaded (Fig. 6). Analysis of the 
execution time for Hybrid Taus and Lagged Fibonacci (RNG time) with the time of generation 
of deterministic dynamics, i.e. without the Gaussian random forces (dynamics without RNG 
time), shows that it takes slightly longer to generate random numbers than to propagate the 
dynamics to the next time step (Fig. 6). This is a pretty high performance level rendering the 
fact that the potential energy function used in our model simulations does not involve long-range 
interactions (Lennard- Jones type potential). Using the Hybrid Taus and Lagged Fibonacci RNG 
leads to a substantial 25— 35-fold speedup, as compared to the CPU-based implementation of 
these generators within the same LD algorithm. Given higher statistical quality of the Hybrid 
Taus and Lagged Fibonacci RNGs, these generators is a reasonable choice for the CPU-based 
implementations of molecular simulations (Fig. 5). Hybrid Taus allows one to obtain faster 
acceleration, compared to Lagged Fibonacci, but the latter has an important advantage over 
the former, namely, that it can be ported to new graphics cards that utilize Multiple Instruction 
Multiple Data (MIMD) architecture 4^. We also applied stringent statistical tests of random- 
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ness to access the statistical properties of random numbers produced by using our GPU-based 
implementation of the Lagged Fibonacci RNG. We found that even when a small short lag 



s/=1252 is used, this RNG does not fail a single tests in the DIEHARD test suite 25|], and 
passes the BigCrush battery of tests in the TestUOl package jisl - 



In conclusion, the development of new Fermi architecture (NVIDIA) 44j and Larrabee archi- 



tecture (Intel) 451], both equipped with 512 ALUs, is an important next step for general purpose 
GPU computing. These next generation processors will utilize MIMD protocol, which will enable 
a developer to use many ALUs in independent computations so that different cores can perform 
concurrently different computational procedures on multiple data sets. Also, high speed inter- 
connection network will provide a fast interface for threads communication. These advances 
in computer architecture will enable the programmer to distribute a computational workload 
among many cores on the GPU more efficiently, and to reach an even higher performance level. 
In a context of MD simulations in implicit solvent and in LD simulations, it will become possible 
to compute random forces using much needed threads synchronization over the entire processor. 
This makes the one-RNG-for-all-threads approach, where thread synchronization is utilized, all 
the more relevant as it will allow one to obtain additional acceleration on the GPU device gain- 
ing from high speed threads communication. Importantly, the GPU-based implementation of 
the Lagged Fibonacci RNG, developed here, could be ported to new graphics processors with a 
few minor modifications. In addition, the one-RNG-for-all-threads approach to random number 
generation can also be used to develop GPU-based implementations of the Mersenne Twister 



RNG, one of the most revered generators 



4^48 



48|, and several other generators, including mul- 



tiple recursive (MRG) and linear/generalized shift feedback register (LSFR/GSFR) generators. 



such as 4-lag Lagged Fibonacci algorithm [15 



32| . Work in this direction is in progress. 
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Appendix A: One-RNG-per-thread approach: Hybrid Taus and Ran2 

In the pseudocodes, that describe the GPU-based implementation of Hybrid Taus and Ran2 
RNGs, superscript h is used to denote the host (CPU) memory, whereas superscript d indicates 
data stored in the device (GPU) global memory. Also, a section of the code executed on the GPU 
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is in the same listing as the code for the CPU. In CUDA implementations, the corresponding 
code for the GPU device is in a separate kernel. 
Algorithm 1: Hybrid Taus algorithm. 

Require: y^[N], yl^[N], y^[N] and yl[N] allocated in CPU memory 
Require: yf\N], y2[N], yi\N] and yi[N] allocated in GPU global memory 

1. yi[l... N] to y^l[l . . . N] ^ initial seeds 

2. y^[l ... AT] to ?/^[l . . . A^] ^ yi[l ... A^] to yfyl ...N] {copying initial seeds to GPU} 
Beginning of GPU code section 

3- 3th ^ thread index 

4. Z/i, 1/2, Z/3 and ?/4 yflith], yHith], vtiith] and yHith] {loading the state} 

5. for i = 1 to 4; i + + do {generating four random numbers} 

6. 6 ^ (((yi < cii) XOR yi) > C2i) 

7. ?/i^(((yiANDci)«C3i)XOR6 

8. (((i/2 < C12) XOR y2) > C22) 

9. ?/2 ^ (((1/2 AND C2) < C32) XOR h 

10. h ^ (((1/3 < C13) XOR ^3) > C23) 

11. ^3 ^ (((2/3 AND C3) < C33) XOR h 

12. ?/4 a?/4 + c 

13. Output mult X ( XOR ?/i XOR ?/2 XOR 2/3 XOR y^) 

14. end for{generating next random number} 

15. yu 2/2, 2/3 and 2/4 ?/i[jt/,], yi[3th], vilith] and [jt/,] {saving the current state} 
End of GPU code section 

In this listing, 6 is a temporary unsigned integer variable, ?/2, 2/3, and y^ are unsigned integer 

random seeds for three Tausworthe generators (lines 6—11) and one LCG (line 12). XOR is a 

binary operation of exclusive disjunction and and "<^" denote binary shift to the right 

and to the left, respectively. In the pseudocode, mM/t=2. 3283064365387x10^^*^ is a multiplier 

that converts a resulting integer into a floating point number, Cii=13, C2i=19, C3i=12, C2i=2, 

C22 =25, C23=4, C3i=3, C32=ll, C33=17, Ci=4294967294, co = 4294967288, and C3=4294967280 are 

constant parameters for three Tausworthe generators 29|, and a=1664525 and c=1013904223 

□ 

are constant parameters for the LCG [19l |. 
Algorithm 2: Ran2 algorithm. 

Require: idumP-\N]^ idu'm2^[N]^ iy^[N] and iv^[N * NTAB] allocated in CPU memory 
Require: idum'^[N], idu'm2''-[N], iy'^[N] and iv'^[N * NT AB] allocated in GPU global memory 
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1. idum^[l . . .N] ■<— initial seeds 

2. for i — to N — 1] i -\- -\- do {loading all N generators} 

3. idum2'^[i] idum'^[i] 

4. for j = NTAB + 7 to 0; j - - do 

5. k idum''[i]/IQl 

6. idum^\i] ^ lA* {idum''\i] - k * IQl) - k * IRl 

7. if idum'^[i\ < then 

8. idum'^[i] — idum'^[i] + I Ml 

9. end if 

10. if j < NTAB then 

11. iv''[i* NTAB + j] = idum^\i] 

12. end if 

13. end for 

14. end forjall N generators are intialized} 

15. iduwP' — >■ idumf'] idum2^ — > idum2'^] iy^ — >■ iy"^; iv^ — )■ iv"^ {copying to GPU} 

16. for f = 0to5';i + +do {starting simulation for S steps} 
Beginning of GPU code section 



17. jth ^ thread index 

18. idum idum'^\jth\] idum2 idu'm2'^\jth\', iy W'^Uth] {copying to GPU local mem- 
ory} 

19. x[A\ {output vector for four random numbers} 

20. for i = to 4; i + + do {generating four random numbers} 

21. k idum/ IQl; idum lAl * {idum — k* IQl) — k* IRl 

22. if idum < then 

23. idum = idum + I Ml 

24. end if 

25. k idum2/IQ2; idum2 IA2 * {idum2 -k* iQ2) -k* IR2 

26. if idum2 < then 

27. idum2 = idum2 + IM2 

28. end if 

29. j ^ iy/NDIV 

30. iv iv'^[jth * NTAB + j] {portion of the RNG state in GPU global memory} 

31. iy — iv — idum2; idum iv'^[jth * NTAB + j] 

32. if < 1 then 
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33. iy^iy + IMMl 

34. end if 

35. tempran •<— AM * iy 

36. if tempran > RNMX then 

37. x[i\ ^ RNMX 

38. else 

39. x[i] •<— tempran 

40. end if 

41. end for {generating next random number} 

42. idum — >■ idum'^\jth\] idum2 — >■ idum2'^\jth\', iy — >■ iy^\jth] {saving to GPU global mem- 
ory} 

43. Output: X 



End of GPU code section 

44. end for{next simulation step} 

Once N RNGs are initialized on the CPU (lines 1—14), initial seeds for all generators are 
copied to the GPU global memory (line 15). The GPU-based computations start on line 16. 

Each thread locates the values of the RNG state in the GPU global memory using thread index 
and copies the values of variables idum, idum2 and iy to the GPU local memory (line 18). 
Array iv is accessed via GPU global memory calls (lines 30 and 31). Each thread generates four 
random numbers (cycle starting on hue 20) and saves them to array x[4]. Gurrent RNG state 
variables are updated in the GPU global memory (line 42). 

Appendix B: One-RNG-for-all-threads approach: Lagged Fibonacci 

Algorithm 3: Additive Lagged Fibonacci algorithm. 
Require: x^[N] allocated in GPU global memory 

1. x'''[l . . . II] -ir- initial seeds 

2. for i = to do {starting simulations} 
Beginning of GPU code section 



3. jth ^ thread index 

4. sMfto ^ Uth + N*t)* RNS 

5. for shift — shifto to shifty + RNS — 1 do 

6. Xii ■<— x'^[shift mod 11] 

7. Xsi ^ x'^[{shift + si — II) mod U] 



20 



8. X •<— {xii op Xsi) mod m 

9. output X 

10. X x'^[shift mod U\ 

11. end for 

End of GPU code section 

12. end for 

To initialize a RNG, a CPU fills in II integer random seeds into x'^ arrays and copies them to the 
GPU (line 1), where each thread computes the location (shiftO) of an integer that corresponds 
to the location of the first random number to be produced. This is done using the current 
simulation step (t), thread index {jth), the total number of threads (iV), and the amount of 
random numbers needed at each step (RNS). Lines 6—10 are repeated until RNS random 
numbers are generated (cycle starting on line 5) using addition operator op (line 8). For every 
random number, two integers from the RNG state have to be gathered (lines 6 and 7). These 
integers correspond to the long lag II and the short lag si. Locations in the array of integers 
are modulo II, which represents "cycling" through the array of state integers starting from the 
beginning of the array (when its end is reached). The resulting integer x (line 8) is reported 
(line 9) and saved for the next steps (line 10). When RNS random numbers are needed in each 
thread at each step of a simulation, sl>RNSxN and U — sl>RNSxN (total number of integers 
updated at each step is RNSxN). 
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FIGURE CAPTIONS 

Fig. 1. Flowchart for generation of random numbers using the one-RNG-per-thread approach 
(panel a) and the one-RNG-for-all-threads approach (panel b). In the one-RNG-per-thread 
setting, independent RNGs (for particles) are running concurrently in computational 
threads on the GPU device generating random numbers from the same sequence, but starting 
from different sets of initial seeds. Within the one-RNG-for-all-threads approach, a single RNG 
is used by all threads running in parallel on the GPU sharing one set of seeds and producing 
A^ subsequences of the same sequence of random numbers. The computational workflow is 
indicated by the arrows, and n, n + 1, . . . are the simulation steps. 

Fig. 2. GPU-based realization of the one-RNG-per-thread approach. The arrows represent the 
direction of computational workflow and data transfer. To launch an RNG on the GPU, A^ sets 
of initial random seeds, one set per thread of execution (per particle) generated on the CPU, 
are transferred to the GPU global memory. Each thread reads corresponding seeds from the 
GPU global memory, and generates random numbers for just one step of a simulation using a 
particular RNG algorithm. When all random numbers have been produced at the n-th step, 
each thread saves its RNG state to the GPU global memory so that it could be used at the next 
step {n + 1). 

Fig. 3. GPU-based realization of the one-RNG-for-all-threads approach and parallel implemen- 
tation of the Lagged Fibonacci algorithm using the cycle division paradigm. The state of the 
Lagged Fibonacci RNG is represented by the circle of II integers. Initial seeds are generated 
on the CPU and copied to the GPU global memory. Generation of A^ random numbers is done 
simultaneously in A^ threads using Eq. (|3]) (shown by arrows). The obtained random numbers 
are saved to update the RNG state for future use. A grid of computational threads is moving 
along the same sequence of random numbers, each time rewriting A^ state variables that appear 
// positions earlier in the sequence. The dark grey squares represent the state variables, and 
the updated portion of the RNG state at a given step; the black "zero line", which denotes the 
position of the first thread, shifts forward by A^ positions at every next step. 

Fig. 4. Semilogarithmic plots of the average particle position {X{t)) (panels a and b) and two- 
point correlation function C{t) = {X{t)X{0)) (panel c) for a system of A^ harmonic oscillators in 
a stochastic thermostat. Theoretical curves of {X{t)) and C{t) are compared with the results 
of Langevin simulations obtained using the LCG, Hybrid Taus, Ran2, and Lagged Fibonacci 
algorithms. Equilibrium fluctuations of {X{t)) on a longer timescale, obtained using LCG, 
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are magnified in panel b, where one can observe a repeating pattern due to the inter-stream 
correlations among N streams of random numbers. 

Fig. 5. Computational performance of the GPU-based implementations of the LCG, Ran2, 
Hybrid Taus, and Lagged Fibonacci algorithms in LD simulations of N three-dimensional har- 
monic oscillators in a stochastic thermostat (color code is explained in the graphs). Panel a: 
A logarithmic plot of the execution time (per 10^ steps) as a function of the system size N. 
Threads have been synchronized on the CPU at the end of each step to imitate the LD sim- 
ulations of a biomolecule. As a reference, also shown is the simulation time with the Ran2 
algorithm implemented on the CPU. Panel b: Memory demand, i.e. the amount of memory 
needed for an RNG to store its current state, as a function of N. A step-wise increase in the 
memory usage for Lagged Fibonacci at A^f=:i0.6xl0^ is due to change in the values of constant 
parameters (Table I). 

Fig. 6. Computational time (per 10^ steps) for an end to end apphcation of LD simulations of N 
three-dimensional harmonic oscillators in a stochastic thermostat using the Hybrid Taus (panel 
a) and Lagged Fibonacci algorithm (panel b), as a function of A'^ (color code is explained in 
the graphs). The simulation time for the full LD algorithm (Langcvin Dynamics) is compared 
with the time for generating random numbers (RNG) and with the time required to obtain 
deterministic dynamics without random numbers (Dynamics w/o RNG). The computational 
speedup (CPU time/GPU time) is displayed in the insets. 
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TABLE I: Constant parameters, i.e. the short lag si and the long lag II, for the Lagged Fibonacci 
RNG for molecular simulations of a system of size (taken from Ref. 0,13). 



N 


<1252 


<3004 


<5502 


< 10095 


< 12470 


<23463 


< 54454 


<279695 


< 288477 


<1010202 


si 


1252 


3 004 


5 502 


10 095 


12 470 


23463 


54454 


279 695 


288477 


1010 202 


11 


2 281 


4423 


9 689 


19 937 


23 209 


44497 


132 049 


756 839 


859433 


3 021377 



TABLE II: Memory usage (in bytes/thread) , and the number of GPU global memory calls, i.e. 
the numbers of read/write operations per one random number (Mi) and for four random numbers 
(M2), for generation of random numbers on the GPU at each step using the LCG, and the Hybrid 
Taus, Ran2, and Lagged Fibonacci RNGs. In molecular simulations, four random numbers are 
needed at each step to generate three (x, y, and z) components of the Gaussian random force per 
particle. 



Parameter 


LCG 


Hybrid Taus 


Ran2 


Lag 


ged Fibonacci 


bytes / thread 


4 


16 


280 




12 


Ml 


1/1 


4/4 


4/4 




3/1 


M2 


1/1 


4/4 


7/7 




12/4 
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Figure 3. (A. Zhmurov, K. Rybnikov, Y. Kholodov, V. Barsegov) 
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